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The Cray Research CRAY-1 computer is a powerful general purpose computer incorporating 
scalar and vector capabilities and a large, fast bi-polar memory. Vector processing provides 
result rates greatly exceeding the result rates of conventional scalar processing. Several unique 
design features of the CRAY-1 provide formidable competition for other existing vector ma- 
chines. This paper describes the structure of the CRAY-1 and enumerates some of the im- 
portant unique design features which give the machine its power. The paper concludes with 
examples to illustrate the speed of the CRAY-1 in vector applications. 



Operating registers 

The primary operating registers of the CRAY-1 are the scalar and vector registers, called S 
and V registers, respectively. Each of the eight V registers has sixty-four elements. A scalar 
instruction may perform some function, such as addition, obtaining operands from two S reg- 
isters and entering the result into another S register. The analagous vector instruction performs 
the same function, obtaining each clock period (12.5 nanoseconds) a new pair of operands 
from two V registers. Results are entered into elements of another V register. The contents 
of the vector length (VL) register determines the number of operations performed by the 
vector instruction. Eight 24-bit A registers are used as address registers for memory references 
and as index registers. The A and S registers are each supported by sixty-four rapid-access 
temporary storage registers, called B and T registers, respectively. Data can be transferred 
between the A, B, S, T, or V registers and memory. 

Memory 

The CRAY-1 memory is constructed of bi-polar 1024-bit LSI chips. Up to one million 64-bit 
words are arranged in 16 banks with a bank cycle time of four clock periods. The short 
cycle time provides an extremely efficient random-access memory. One parity bit per word 
is maintained in 16 modules of the CPU. There is no inherent memory degradation for ma- 
chines with less than one million words of memory. 

Instruction buffers 

All instructions, which may be 16 or 32 bits, are executed from four instruction buffers, each 
consisting of sixty-four 16-bit registers. Associated with each instruction buffer is a base ad- 
dress register that is used to determine if the current instruction resides in a buffer. Since the 
four instruction buffers are large, substantial program segments may reside in the buffers. 
Forward and backward branching within the buffers is possible and the program segments 
may be discontiguous. When the current instruction does not reside in a buffer, one of the 
instruction buffers is filled from memory. Four memory words are read per clock period to 
the least recently filled instruction buffer. To allow the current instruction to issue as soon 
as possible, the memory word containing the current instruction is among the first to be read. 

Input/Output 

There are twenty- four I/O channels, twelve of which are input and twelve output. Any num- 
ber of channels may be active at a given time. Each channel has a maximum transfer rate 
of 640 megabits. At most one 64-bit word per clock period can be transferred to or from 
memory and this is attained when four input channels and four output channels are simul- 
taneously operating at their maximum rate. In practice this theoretical transfer rate is limited 
by the speed of peripheral devices and by memory reference activity of the CPU. 
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Figure 1 . Registers block diagram 
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Functional units 

There are twelve functional units in the CPU. Each is a specialized unit implementing algo- 
rithms for a portion of the instructions. Each unit is independent of the other units and a 
number of functional units may be in operation at the same time. A functional unit receives 
operands from registers and delivers the result to a register when the function has been per- 
formed. There is no information retained in a functional unit for reference in subsequent 
instructions. These units operate essentially in three-address mode with limited source and 
destination addressing. 
Three functional units provide 24-bit results to the A registers only: 

— integer add 

— integer multiply 

— population count 

Three functional units provide 64-bit results to the S registers only: 

— integer add 

— shift 

— logical 

Three functional units provide 64-bit results to the V registers only: 

— integer add 

— shift 

— logical 

Three functional units provide 64-bit results to either the S or V registers: 

— floating add 

— floating multiply 

— reciprocal approximation 

Integer arithmetic is performed in 2's complement mode. Floating point quantities have signed 
magnitude representation. 

All functional units are fully segmented. This means that the information arriving at the unit, 
or moving within the unit, is captured and held in a new set of registers at the end of every 
clock period. It is therefore possible to start a new set of operands for unrelated computation 
into a functional unit each clock period even though the unit may require more than one clock 
period to complete the calculation. All functional units perform their algorithms in a fixed 
amount of time. No delays are possible once the operands have been delivered to the unit. 
Functional units servicing the vector instructions produce one result per clock period. 

Vector instructions 

Vector instructions may be classified into four types. One type of vector instruction obtains 
operands from one or two V registers and enters results into another V register (Fig. 2a). 
Successive operand pairs are transmitted from Vj and Vk to the segmented functional unit 
each clock period and the corresponding results emerge from the function unit n clock periods 
later, n is constant for a given functional unit and is called the functional unit time. Results 
are entered into result register Vi. The contents of the vector length (VL) register determines 
the number of operand pairs processed by the functional unit. 

The second type of vector instruction obtains one operand from an S register and one from 
a V register (Fig. 2b). A copy of the S register is transmitted to the functional unit with 
each V-register operand. 

The last two types of vector instruction transmit data between memory and the V registers 
(Fig. 2c and 2d). A path between memory and the V registers may be considered a func- 
tional unit for timing considerations. 





Figure 2a. Type I vector instruction 



Figure 2b. Type II vector instruction 





Figure 2c. Type III vector instruction 



Figure 2d. Type IV vector instruction 



It is important to understand functional unit segmentation, especially as it relates to execution 
of vector instructions. Let a particular element of a V register be specified by adding the 
element number as a subscript to the register name. For example, the first three elements of 
register VI are Vlo, Vli, and VI2, respectively. Since a vector register has sixty-four ele- 
ments, the last element of VI is Vies. Figure 3 shows a timing chart for execution of a 
floating point addition instruction. This instruction is type I since operands are obtained from 
two vector registers. When the instruction issues at clock period to, the first pair of elements 
(Vlo and V2o) is transmitted to the add functional unit where it arrives at clock period ti. 
The dashed lines indicate transit to and from the functional unit. The functional unit time 
for this unit is six clock periods, so the first result, which is the sum of Vlo and V2o, exits 
from the functional unit at clock period t?. The sum is transmitted to the first element of 
result register VO arriving at clock period ts. Because the functional unit is fully segmented, 
the second pair of elements (VI 1 and V2i) is transmitted to the add functional unit at clock 
period ti. At clock period t2 the functional unit is in the process of performing two additions 
simultaneously, since the addition of Vlo and V2o was begun in the previous clock period. 
The second result, which is the sum of VI 1 and V2i, is entered into the second element of 
result register VO at clock period ts. Continuing in this manner, a new pair of elements enters 
the functional unit each clock period and the corresponding sum emerges from the unit six 
clock periods later and is transmitted to the result register. Since a new addition is begun 
each clock period, six additions may be in progress at one time. In general, the number of 
operations which can be performed simultaneously by a functional unit is equal to the func- 
tional unit time. 
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Figure 3. Vector instruction timing example (VO «- VI + V2) 



The vector length determines the total number of operations performed by a functional unit. 
Although each vector register has sixty -four elements, only the number of elements specified 
by the vector length register is processed by a vector instruction. Vectors which have more 
than sixty-four elements are processed under program control, in groups of 64 (with a possi- 
ble residue). A later section on vector loops will illustrate the processing of long vectors. 



Functional unit and operand register reservations 

When a vector instruction issues, the required functional unit and the operand registers are re- 
served for the number of clock periods determined by the vector length. A subsequent vector 
instruction which requires the same functional unit or operand register cannot issue until the 
reservations are released. When two vector instructions use different functional units and vec- 
tor registers, they are independent and may issue in neighboring clock periods. Example la 
shows two independent instructions. Both execute concurrently with a one clock period differ- 
ence in their issue times. Examples lb- Id illustrate the effect of functional unit and operand 
register reservation when two instructions are not independent. Example lb shows two add 
instructions. When the first instruction issues, the floating add functional unit and operand 
registers VI and V2 are reserved. Issue of a second add instruction is delayed until the func- 
tional unit is free. Example lc shows an add instruction followed by a multiply instruction. 
As in the previous example, the floating add functional unit and operand registers VI and V2 
are reserved when the first instruction issues. Issue of the second instruction is delayed until 
the operand register VI is free. The second instruction in example Id is delayed because of 
both functional unit and operand register reservations. 

V0«-V1 + V2 V3*-V1 + V2 

V3<^V4*V5 V6*-V4+V5 

Ex. la. Independent Ex. lb. Functional unit 

instructions reservation 

V3<-V1 + V2 V0—V1 + V2 

V6^V1*V5 V3^V1 + V5 

Ex. lc. Operand register Ex. Id. Functional unit 

reservation and operand register 

reservation 

Result register reservations and chaining 

When a vector instruction issues, the result register is reserved for the number of clock periods 
determined by the vector length and functional unit time. This reservation allows the final 
operand pair to be processed by the functional unit and the corresponding result to be trans- 
mitted to the result register. 

A result register becomes the operand register of a succeeding instruction. In the process 
called "chaining", the succeeding instruction issues as soon as the first result arrives for use 
as an operand. This clock period is termed "chain slot time" and it occurs only once for each 
vector instruction. If the succeeding instruction cannot issue at chain slot time because of a 
prior functional unit or operand register reservation, then it must wait until the result register 
reservation is released. Figure 4a shows a chain of four instructions which read a vector of 
integers from memory, add that vector to another, shifting the sum, and finally forming the 
logical product of the shifted sum and a mask vector. The result of the four instructions is 
in vector register V5. Figure 4b graphically depicts the passage of information through the 
functional units. The diagram illustrates that the functional units may be considered links in 
a chain which works as a whole to produce the final result. 

VO - — memory (memory read) V3 ■* — V2<A3 (left shift) 

V2 - — V0+V1 (integer add) V5 ■* — V3&V4 (logical product) 

Figure 4a. Chaining example 




Figure 4b. Pictorial representation of chaining example 



The timing diagram in Figure 4c will help clarify the concept of chaining. Gradations along 
the horizontal axis represent clock periods. The memory read instruction issues at clock period 
to. Each horizontal line shows the production of one element of the V5 result vector. Time 
spent in passing through each of the four functional units used in the instruction sequence is 
indicated by bars of corresponding length. The bars are color-coded to agree with Figures 4a 
and 4b. Note that the production of a new element of V5 begins each clock period. Produc- 
tion of the first element of V5 begins at clock period to with the reading of the first word 
from memory, production of the second element of V5 begins at clock period h, with the 
reading of the second word from memory, and so on. The first result enters V5 at clock 
period 123 and a new result enters V5 each clock period thereafter. The first horizontal line, 
which shows the production of the first element of V5 (V5o), is reproduced below the timing 
diagram with segments lettered for identification. Chain slot times for each functional unit 
are marked by asterisks. A detailed description of the production of V5o may be illustrative; 
production of the other elements of the result vector is identical except for the staggered start 
times. 

The vector read instruction issues at clock period to. The first word arrives in element of 
register VO at clock period is, and is immediately transmitted along with element of regis- 
ter Vi, as an operand to the integer add functional unit. When the two operands arrive at the 
integer add functional unit at clock period to, the computation of the sum of VOo and VI 1 is 
begun. Three clock periods later (ti2) the sum is sent from the functional unit to element 
zero of V2. It arrives at clock period ti3 and is immediately transmitted as an operand to the 
shift functional unit. At clock period ti4 the operand arrives at the shift functional unit and 
the shift operation is begun. The operation is completed four clock periods later (tis) and the 
shifted sum is sent from the functional unit to element zero of V3, arriving the next clock 
period. It is immediately transmitted, along with element zero of V4, as an operand to the 
logical functional unit. When the two operands arrive at the logical functional unit at clock 
period 120, the computation of the logical product of V3o and V4o is begun. Two clock peri- 
ods later (122) the final result is sent from the functional unit to element zero of V5, arriving 
at clock period t23. While all this has been going on, production of the second element of 
V5 has been tracing the same path through the vector registers and functional units with a 
one clock period lag. Production of the third element of V5 lags one more clock period 
behind, and so on. A new result arrives at the V5 result register each clock period. 



t tj t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 t 1Q t n t 12 t 13 t 14 t ls t 16 t 17 t 18 t 19 t 20 t 2 , t 22 t 23 t 24 t 2s t 26 t 2? t 28 t 29 t 30 t 3] 



■ V5, 



V5, 

V5. 



V5, 



. VS, 



■ V5 



5 

— V5 



6 

V5 



7 

— V5 



8 



«-»- 



^->— *- 



v 

a b c d e f g 

a — transit of memory word to "read functional unit" 

b — transit of memory word through "read functional unit" 

c — transit of memory word from "read functional unit" 
to element of VO 

d — transit of operand elements in VO and VI to integer add 
functional unit 

e — computation of sum by integer add functional unit 

f — transit of sum from integer add functional unit to element 
ofV2 



h 



J *-^- J 



i j k 1 

g — transit of operand element in V2 to shift functional unit 

h — shift operation performed by shift functional unit 

i — transit of shifted sum from shift functional unit to 
element of V3 

j - transit of operand elements in V3 and V4 to logical functional unit 

k — logical operation performed by logical functional unit 

1 - transit of final result to element of V5 



Figure 4c. Timing diagram for chaining example 



Vector loops 

Long vectors are processed in segments since the vector registers of the CRAY-1 cannot ac- 
commodate vectors with more than 64 elements. The program construct created to process 
long vectors is called a vector loop. Each pass through the loop processes a 64-element (or 
smaller) segment of the long vectors. The general procedure is to compute the loop count 
based on the vector length before entering the loop. Inside the loop the program takes full 
advantage of the twelve independent functional units and chaining to read the current vector 
segments from memory, execute the required functions and return the results to memory. 
Loop control is performed in the scalar registers concurrently with vector processing. Loop 
branch time is hidden by the vector operations. 

The following simple FORTRAN loop can be used to illustrate processing of long vectors: 

D0 100 I = I, N 
100 A(I) = 5.*B(I) + C 

When N is 64 or less, all elements of the A array can be assigned a value with a sequence 
of seven instructions: 

1. SI <— 5. set constants 

2. S2 <- C 

3. VL^ N set vector length 

4. V0 *— B array read B array ) 

5. VI <— S1*V0 multiply elements by 5 > chain 

6. V2<- S2 + V1 add C j 

7. A array <— V2 store A array 

Instructions 4-6 use different functional units ("memory", multiply and add, respectively) and 
so they can be chained. When the V2 result register is free, the results are stored in the A 
array. 

When N exceeds 64, a vector loop is required to generate the entire A array. Before entering 
the loop, N is divided by 64 to determine the loop count. If there is a remainder, less than 
64 elements of A are generated in the first pass through the loop. The loop performs instruc- 
tions 4-7 for a segment of the A and B arrays. The last vector operation at the bottom of 
the loop is to store the current segment of the A array in memory. This operation must be 
completed before the next segment of the B array can be read in the next pass through the 
vector loop. The time required to decrement the loop counter, increment the current position 
in the arrays and branch to the top of the loop is hidden because it is done in parallel with 
the store operation. 

The concept of chaining and the use of register-to-register vector instructions are unique to 
the CRAY-1. The design minimizes the problem of speed degradation associated with mem- 
ory-to-memory vector instructions. Start-up time for vector operations in the CRAY-1 are 
nominal; the benefits of vector processing are visible even for short vectors. The extensive 
concurrency attainable through the use of the 12 fully-segmented functional units can produce 
impressive results. A performance study of several subroutines for the CRAY-1 FORTRAN 
library and matrix multiplication illustrates this fact. 

Fortran library 

Each scalar FORTRAN library subroutine has a vector analog which employs the same algo- 
rithm in vector mode to produce several results at a time. The scalar subroutines must be 
called for each desired result, while the vector subroutines process an argument vector to 
obtain a vector of results. Performance studies on the CRAY-1 indicate that the vector sub- 
routine outperforms its scalar counterpart whenever a vector of two or more results is required. 
Figure 5 depicts the behavior of the scalar and vector subroutines for several library functions. 
The cost of a result (in clock periods) is graphed as a function of vector length. The cost is 
constant for scalar subroutines since they must be called for each desired result; however, for 
vector subroutines the cost drops dramatically and rapidly approaches a lower limit as vector 
length increases. In all cases the vector cost is less than the scalar cost when more than one 
result is produced. 
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Matrix multiplication 

Let [X] denote a matrix and let the element in row i, column j be denoted by xy. Given 
matrix [A] of dimension K by N and the matrix [B] of dimension N by M, the product 
matrix [C] = [A] ■ [B] is defined by 

N 

Cij = y 3in ' bnj 

n=l 

Calculation of the product matrix is amenable to vector processing. The combination of mul- 
tiplication and addition lends itself well to chaining. Figure 6 shows the CRAY-1 execution 
rate for multiplication of square matrices as a function of matrix dimension. The execution 
rate is defined in terms of "millions of floating point operations per second" (MFLOPs). This 
measure is more meaningful than the classical "millions of instructions per second" (MIPs), 
especially when comparing relative speeds of scalar and vector machines; a single vector in- 
struction is equivalent to a loop of several scalar instructions. The number of floating point 
operations required to multiply two n-dimensional square matrices is n 2 -(2n-l), since each of 
the n 2 elements of the result matrix is, formed by summing n products. 
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Matrix multiplication is typical of the large class of problems which can be vectorized. For 
these problems a significant increase in processing speed can be achieved over conventional 
scalar processing. Register-to-register vector instructions and the large amount of concurrency 
attainable through use of the twelve independent functional units and chaining can provide 
processing speeds exceeding those of other existing vector machines. Fields such as weather 
forecasting, nuclear research and seismic analysis will find the CRAY-1 a welcome extension 
to their workshops of computing tools. 
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